<h1> <center>ANALYZING TWITTER DATA TO IDENTIFY WORKPLACE RELATED ISSUES IN REAL TIME </center></h1>

<img src="https://github.com/ssuleyma/SocialMediaAnalysis/blob/master/Twitter-Workplace-Issues/Photos/animation.gif?raw=true/" alt="Dashborard" style="width:1000px;height:500px;">

### Table of contents
**Note:** Please open the instructions in a new tab as it can't load the document by left clicking. <br>
Follow the [instructions](https://ibm.box.com/s/1t4ism8ru8bdsu3q272sqlgknxys6mju) to run the notebook. <br>
This notebook is divided into the following parts:

[Part 1: Setup](#setup)<br>
[Part 2: Accessing Twitter API and Scraping the Data](#access)<br>
[Part 3: Cleaning the Twitter Data](#clean)<br>
[Part 4: Watson Discovery | NLU](#watson)<br>
[Part 5: Analyzing Enriched Data](#analyze)<br>
[Part 6: Create a Watson Knowledge Studio Model](#wks)<br>
[Part 7: Custom Model](#custom)<br>
[Part 8: Visualizing the Results on World Map](#visualize)<br>

<a id="setup"></a>
# 1. Setup

**NOTE:** We need a project token as it lets us to import/export assets from our project. for example the csv file we upload as a part of the assets we uploaded earlier.

### INSERT PROJECT TOKEN
1. Go to the 3-dots and select insert project token
2. Follow the error message link to the project settings page and create a new key
3. Come back to the notebook and hit the "insert project token" option again
4. Scroll to the top and run the new cell with your project token

## 1.1. Importing libraries

**NOTE:** We need to import these libraries in order to run the notebook.</br>

1. Numpy - NumPy is a package in Python used for Scientific Computing. NumPy package is used to perform different operations. The ndarray (NumPy Array) is a multidimensional array used to store values of same datatype. We use this for doing operations on the dataframe we create (twitter_data).

2. Pandas - pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. We use this for convert our csv asset into a dataframe (twitter_data).

3. json - The JSON module is mainly used to convert the python dictionary above into a JSON string that can be written into a file. While the JSON module will convert strings to Python datatypes, normally the JSON functions are used to read and write directly from JSON files.

4. re - This module provides regular expression matching operations. A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).

5. OS - The OS module in Python provides a way of using operating system dependent functionality. The functions that the OS module provides allows you to interface with the underlying operating system that Python is running on.

6. datetime and time - In Python, date, time and datetime classes provides a number of function to deal with dates, times and time intervals. Date and datetime are an object in Python, so when you manipulate them, you are actually manipulating objects and not string or timestamps.

7. Geotext - Geotext extracts country and city mentions from text.

8. Twitter - used for importing twitter API and auth from it.

9. IBM-watson - used for connecting to different watson services in teh notebook. for example discovery.

In [None]:
import numpy as np

!pip install --upgrade pandas
import pandas as pd

import json
import re

import os
import datetime
import time

!pip install geotext
from geotext import GeoText
import geotext

!pip install twitter
import twitter

!pip install --upgrade ibm-watson

# 2/3. Alternative - Import Workshop Twitter Data

<img src="https://github.com/BKDuncan/Tweet-Analysis/blob/master/images/Step%202%203.PNG?raw=true" alt="step 2/3">

### For this workshop we prepared data for you, so you don't need to signup for a twitter developer account. You can run this cell and skip steps 2 and 3.

**NOTE:** The cell below used the porject token to import the csv data using var: my_file. Then it gets converted into a dataframe var: twitter_data and we show the first 20 entries of the data.

In [None]:
# Fetch the file
my_file = project.get_file("workplace_tweets_prepared_data.csv")

# Read the CSV data file from the object storage into a pandas DataFrame
my_file.seek(0)
twitter_data = pd.read_csv(my_file, nrows=1000)
twitter_data = twitter_data.replace(np.nan, '')

twitter_data.head(20)

<a id="access"></a>
# 2. Accessing Twitter API and Scraping the Data

### Follow the instructions document to make a Twitter Developer Account, Twitter App, and Generate Keys.

In [None]:
# Go to https://developer.twitter.com/en/apps to create an app and get values for these credentials.
# You'll need to provide them in place of these empty string values that are defined as placeholders to access Twitter API.
access_token = "paste your token here"
access_token_secret = "paste your token here"
consumer_key = "paste your token here"
consumer_secret = "paste your token here "

In [None]:
# See https://developer.twitter.com/en/docs for more information on Twitter's OAuth implementation.
auth = twitter.oauth.OAuth(access_token, access_token_secret,consumer_key,consumer_secret)
twitter_api = twitter.Twitter(auth=auth)

# Set this variable to a trending topic, or anything else for that matter. 
# The example query below was a trending topic when this content was being developed and is used throughout the remainder of this notebook.
queries = ['worker','workplace','workersrights','employer', 'employee','employment','employmentlaw',
           '#worker','#workplace','#workersrights','#employer', '#employee','#employment','#employmentlaw']
count = 100

statuses = []
query_text = []

for q in queries:
    search_results = twitter_api.search.tweets(q=q,lang='en',count=count,tweet_mode="extended")['statuses']
    
    statuses.extend(search_results)
    query_text.extend([q]*len(search_results))

<a id="clean"></a>
# 3. Cleaning the Twitter Data

In [None]:
# Select the valuable fields
status_texts = [ {'tid': status['id'],'text':status['full_text'],'time': status['created_at'],
                  'lang':status['lang'],'location':status['user']['location'],'place':status['place'],
                  'source':status['source'],'retweeted':status['retweet_count'], 
                  'user': status['user']['screen_name']} for status in statuses ]

# Create a data frame
twitter_data = pd.DataFrame(data=status_texts)

# Extract Country if place is not empty
twitter_data['place'] = twitter_data['place'].apply(lambda l: None if l == None else l['country'])

# Extract time 
twitter_data["time"] = twitter_data['time'].apply(lambda dt: dt.split(" ")[1] + ", " + dt.split(" ")[2] + ", " + dt.split(" ")[5] + ", " + dt.split(" ")[3])
twitter_data["time"] = twitter_data["time"].apply(lambda s: datetime.datetime.strptime(s, '%b, %d, %Y, %H:%M:%S'))

# Extract source link
twitter_data['source'] = twitter_data["source"].apply(lambda s: s.split(" ")[1][5:])

# Remove the user name of retweet
twitter_data["text"] = twitter_data["text"].apply(lambda s: re.sub(r"^RT @.{0,20}:","",s) if re.match(r"^RT @.{0,20}:",s) else s)
twitter_data["text"] = twitter_data["text"].apply(lambda s: s.strip())

# Add the query text to the dataframe
twitter_data['q_text'] = query_text

# Remove duplicate tweets i.e. retweets
twitter_data = twitter_data[~twitter_data['text'].duplicated()]

# Reset index and rename columns
twitter_data.reset_index(drop=True,inplace=True)
twitter_data.rename(columns={'time':'datetime'}, inplace=True)

In [None]:
twitter_data.head()

In [None]:
twitter_data.info()

<a id="watson"></a>
# 4. Watson Discovery | Natural Language Understanding (NLU)

<img src="https://github.com/BKDuncan/Tweet-Analysis/blob/master/images/Step%204.PNG?raw=true" alt="step 4">

### Follow the instructions document to create your Watson Discovery Service before continuing.

**NOTE:** We access the discovery services using the batch variable.

In [None]:
batch = "batch_one"

## 4.1. Creating JSON for each tweet

**NOTE:** This converts each tweet into a json obejct as disovery casn only operate on json type objects. we create a new directory in discovery named "tweets" and push each tweet in it.

In [None]:
## to make int64 serializable for JSON file for datetime column
def default(o): 
    if isinstance(o, np.int64): return int(o)  
    else: return str(o)

## creating tweets directory    
if not os.path.isdir('tweets_'+ batch):
    os.mkdir('tweets_'+batch)

## creating json file per tweet
for i in twitter_data.index:
    with open(("./tweets_{}/tweet_{}.json".format(batch,i)),"w") as outfile:
        json.dump({"text": twitter_data.loc[i,"text"],"datetime": twitter_data.loc[i,"datetime"],
                   "tid": twitter_data.loc[i,"tid"],"user": twitter_data.loc[i,"user"], "q_text": twitter_data.loc[i,"q_text"],
                   "source": twitter_data.loc[i,"source"],
                   "location": twitter_data.loc[i,"location"],"place": twitter_data.loc[i,"place"]}, outfile, default=default) 

## 4.2. Initiating Watson Discovery

Before running the cells below complete the corresponding steps in the Instructions (you'll need a watson discovery collection).

**NOTE:** this cell establishes the connection to discovery using ibm-watson then we import the id's from discovery service and get connected.

In [None]:
# Setup Discovery API so we can interact with it using this Notebook
from ibm_watson import DiscoveryV1

discovery = DiscoveryV1(version='2019-03-25',
                        iam_apikey="YOUR_API_KEY", # Enter your credentials
                        url="https://gateway.watsonplatform.net/discovery/api")

env_id = 'YOUR_ENVIRONMENT_ID' # Enter your credentials
col_id = 'YOUR_COLLECTION_ID' # Enter your credentials

discovery.set_default_headers({'x-watson-learning-opt-out': "true"})

**NOTE:** here we are uploading the tweets to discovery service.

In [None]:
# Upload each Tweet to Watson Discovery so it can "enrich" them (meaning it will identify categories, entities, and the sentiment of the tweet). 
# For further analysis we could also include the Watson Tone Analyzer to identify emotions, Watson Personality Insights to identify the users Traits, etc.
for ind, file_object in enumerate(os.listdir("./tweets_{}/".format(batch))):
    if file_object.endswith(".json"):
        document_json = (open(os.path.join("./tweets_{}/".format(batch),file_object)).read())
        doc_id = "tweet" + str(ind)
        add_doc = discovery.update_document(environment_id=env_id, collection_id=col_id, document_id = doc_id,file = document_json, 
                                            file_content_type = "application/json", filename = file_object).get_result()
    else:
        pass

In [None]:
discovery.get_collection(env_id, col_id).get_result()

<a id="analyze"></a>
# 5. Analyzing Enriched Data

<img src="https://github.com/BKDuncan/Tweet-Analysis/blob/master/images/Step%205.PNG?raw=true" alt="step 5">

**NOTE** The cell below creates an array of fields we want to use for enrichment after they got put in discovery. 

In [None]:
# Select fields to extract from enriched data
flds = ['tid','user','text','place','location','datetime',
        'enriched_text.sentiment.document.label','enriched_text.sentiment.document.score',
        'enriched_text.categories.label','enriched_text.categories.score',
        'enriched_text.entities.type','enriched_text.entities.relevance','enriched_text.entities.text',
        'enriched_text.entities.sentiment.label','enriched_text.entities.sentiment.score']
flds = ','.join(flds)
flds

**NOTE:** The cell below allows discovery to run queries on the fields. Then it saves that to a dataframe named enriched_data.

In [None]:
# Obtain enriched data from Watson Discovery
query_result = discovery.query(environment_id=env_id, collection_id=col_id, return_fields=flds, count=1000).get_result()['results']
enriched_data = pd.DataFrame(query_result)

**NOTE:** The cell below applies the enrichments discovery uses to refine the data, these are categories, entities and sentiment.

In [None]:
# Clean the dataframe
enriched_data['categories'] = enriched_data['enriched_text'].apply(lambda d: d.pop('categories') if 'categories' in d else np.nan)
enriched_data['entities'] = enriched_data['enriched_text'].apply(lambda d: d.pop('entities') if 'entities' in d else np.nan)
enriched_data['sentiment'] = enriched_data['enriched_text'].apply(lambda x: x.pop('sentiment')['document'] if 'sentiment' in x else np.nan)

enriched_data.drop(columns={'enriched_text','id','result_metadata'},inplace=True)

enriched_data['sentiment_score'] = enriched_data['sentiment'].apply(lambda x: x.pop('score') if not isinstance(x, float) else 0)
enriched_data['sentiment'] = enriched_data['sentiment'].apply(lambda x: x.pop('label') if not isinstance(x, float) else 'neutral')

## 5.1. Applying First Filter to Extract Non-positive Tweets

In [None]:
# First filter: non-positive tweets. We are interested in the neutral/negative ones for our study.
e_data_filtered = enriched_data.copy()[enriched_data['sentiment'] != 'positive'].reset_index(drop=True)

In [None]:
print("There are {} non-positive tweets.".format(len(e_data_filtered)))

In [None]:
e_data_filtered.head(5)

## 5.2. Applying Second Filter to Extract Relevant Tweets by Category and Content

In [None]:
# Second filter: by relevant categories and words. We want to remove any unrelated tweets.
def categories_filter(ct):
    relevant_cats_one = set(['/society/work'])
    relevant_cats_two = set(['/society/work/unemployment','/society/crime/sexual offence','/society/crime/personal offense',
                             '/society/welfare/healthcare','/society/unrest and war',
                             '/business and industrial/construction','/religion and spirituality/islam'])
    sports = re.compile("(auto)")
    irrelevant_cats = re.compile("(careers)|(education)|(robotics)|(investing)|(shopping)|(family)|(travel)|(science)|(health and fitness)|(art )")
    irrelevant_cats_one = re.compile("(careers)|(sports)|(education)|(robotics)|(investing)|(shopping)|(family)|(travel)|(science)|(health and fitness)|(art )")
    irrelevant_cats_two = re.compile("(careers)|(sports)|(education)|(art )|(news)|(plans)|(robotics)|(investing)|(shopping)|(family)|(travel)|(business operations)|(health and fitness)")
    
    if (any(i in ct for i in relevant_cats_one))&(sports.search(" ".join(ct)) != None)&(irrelevant_cats.search(" ".join(ct)) == None):
        return True
    elif (any(i in ct for i in relevant_cats_one))&(irrelevant_cats_one.search(" ".join(ct)) == None):
        return True
    elif (any(i in ct for i in relevant_cats_two))&(irrelevant_cats_two.search(" ".join(ct)) == None):
        return True
    else:
        return False
    
def stop_words_filter(t):
    irrelevant_words = ['jobsearch', 'interview', 'apply now','stock','employment law', 'employmentlaw', 'legislation',
                        'research', 'study','survey','reforms','nigga','tweet','whoopee','announcement','read here','news','government',
                        'trump','brexit','hard worker','culture','leadership','president','regulat','federal','for more on ','blog post',
                        'need a job','tax','hr']
    irre_words_one= re.compile("|".join(irrelevant_words),re.IGNORECASE)
    irre_words_two =re.compile("(how to)|(how do)|(tips for)|(tips to)|(under (?s)(.*) act)|(text [[0-9]+)|(job(?s)(.*) wanted)", re.IGNORECASE)
    
    re_words=re.compile("discrimmination|protest|fatalindustrialinjury|injur|factory",re.IGNORECASE)
    
    if re_words.search(t):
        return True
    elif irre_words_one.search(t) or irre_words_two.search(t):
        return False
    else:
        return True

In [None]:
# Drop Tweets that didn't have enough data to be enriched (NaN). These are usually tweets that had very few words or nouns for Discovery to interpret.
for index, item in enumerate(e_data_filtered['categories']):
    if(not isinstance(item, list)):
        print("DROP DATA:")
        print(index, item)
        for type in e_data_filtered:
            print(str(type) + ': ' + str(e_data_filtered[type][index]))
        e_data_filtered = e_data_filtered.drop(index)

In [None]:
e_data_filtered['cat_labels'] = e_data_filtered['categories'].apply(lambda l: set([i['label'] for i in l]))
e_data_filtered['cat_relevant'] = e_data_filtered['cat_labels'].apply(categories_filter)
e_data_filtered['word_relevant'] = e_data_filtered['text'].apply(stop_words_filter)
e_data_filtered = e_data_filtered[e_data_filtered['cat_relevant'] & e_data_filtered['word_relevant']].reset_index(drop=True)

In [None]:
print("There are about {} relevant tweets among non-positive tweets.".format(len(e_data_filtered)))

In [None]:
e_data_filtered.head(5)

## 5.3. Applying Third Filter to Extract Tweets with Category Confidence Score Above 70%

In [None]:
# Third filter: category confidence score filtering
score_cats = ['/society/work','/business and industrial/business operations/human resources/compensation and benefits','law, govt and politics']
e_data_filtered['work_score'] = e_data_filtered['categories'].apply(lambda l: np.sum([i['score'] if i['label'] == score_cats[0] else 0 for i in l]))
e_data_filtered['bene_score'] = e_data_filtered['categories'].apply(lambda l: np.sum([i['score'] if i['label'] == score_cats[1] else 0 for i in l]))
e_data_filtered['law_score'] = e_data_filtered['categories'].apply(lambda l: np.sum([i['score'] if i['label'] == score_cats[2] else 0 for i in l]))

e_data_filtered = e_data_filtered[(e_data_filtered['work_score'] >= 0.7)|(e_data_filtered['work_score'] >= 0.7)|
                                  (e_data_filtered['law_score'] >= 0.7)|(e_data_filtered['work_score'] == 0)]
# Dropping unnecessary columns and resetting index
e_data_filtered.drop(columns=['categories','cat_relevant','word_relevant'],inplace=True)
e_data_filtered.reset_index(drop=True,inplace=True)

In [None]:
print("There are {} relevant tweets among non-positive tweets with 70% relevant category confidence.".format(len(e_data_filtered)))

In [None]:
e_data_filtered.head(5)

## 5.4. Removing # and @ Signs

In [None]:
e_data_filtered['text'] = e_data_filtered['text'].apply(lambda x: re.sub(r'[@#]','',x))

In [None]:
e_data_filtered.head(3)

<a id="wks"></a>
# 6. Create a Watson Knowledge Studio Model

<img src="https://github.com/BKDuncan/Tweet-Analysis/blob/master/images/Step%206.PNG?raw=true" alt="step 6">

### Go to the instructions pdf document and follow the steps to train a Knowledge Studio Model.

In [None]:
 '''
     This cell doesn't do anything, but you can paste you model id and iam_apikey from step 4.2 here for safe keeping...
     [ model_id: ]
     [ iam_apikey: ]
 
'''

<a id="custom"></a>
# 7. Custom Model

<img src="https://github.com/BKDuncan/Tweet-Analysis/blob/master/images/Step%207.PNG?raw=true" alt="step 7">

Before running the cells, go back to your discovery service and create a new collection for our custom model. We will upload the non-positive tweets we identified here for more vigorous analysis.

In [None]:
from ibm_watson import DiscoveryV1

discovery_custom = DiscoveryV1(version='2018-10-15',
                        iam_apikey="YOUR_API_KEY", # Enter your credentials (same as above, step 4.2) 
                        url="https://gateway.watsonplatform.net/discovery/api")

env_id_custom = 'YOUR_ENVIRONMENT_ID' # Enter your credentials (same as above, step 4.2)
col_id_custom = 'YOUR_COLLECTION_ID' # Enter your credentials (NEW)

conf_id_custom = 'YOUR_CONFIGURATION_ID' # Enter your credentials (NEW)

discovery_custom.set_default_headers({'x-watson-learning-opt-out': "true"})

## 7.1. Creating JSON for each tweet

In [None]:
# Save the tweets in JSON format
batch = "batch_one"

## to make int64 serializable for JSON file for datetime column
def default(o): 
    if isinstance(o, np.int64): return int(o)  
    else: return str(o)

## creating tweets directory    
if not os.path.isdir('filtered_tweets_'+ batch):
    os.mkdir('filtered_tweets_'+batch)

## creating json file per tweet
for i in e_data_filtered.index:
    with open(("./filtered_tweets_{}/tweet_{}.json".format(batch,i)),"w") as outfile:
        json.dump({"text": e_data_filtered.loc[i,"text"],"tid": e_data_filtered.loc[i,"tid"]}, outfile, default=default) 

## 7.2 Sending the Data to the Default Model

### Discovery won't let us apply our custom Watson Knowledge Studio model unless there is data in the collection, so we'll upload some tweets.

In [None]:
# Upload data to discovery (JSON formatted)
batch = "batch_one"
for ind, file_object in enumerate(os.listdir("./filtered_tweets_{}/".format(batch))):
    if file_object.endswith(".json"):
        document_json = (open(os.path.join("./filtered_tweets_{}/".format(batch),file_object)).read())
        doc_id = "tweet" + str(ind)
        add_doc = discovery_custom.update_document(environment_id=env_id_custom, collection_id=col_id_custom,
                                                   document_id = doc_id,file = document_json,
                                                   file_content_type = "application/json", filename = file_object).get_result()
    else:
        pass

In [None]:
discovery_custom.get_collection(env_id_custom, col_id_custom).get_result()

## 7.3. Adding a Custom Model to Discovery

In [None]:
# Create custom configuration (a 'configuration' tells discovery which data it should enrich and how it should enrich it)
custom_conf = discovery_custom.get_configuration(environment_id=env_id_custom,configuration_id=conf_id_custom).get_result()
custom_conf['enrichments'][0]['destination_field']  = 'wks_enriched_text'

default_conf = {'destination_field': 'enriched_text','enrichment': 'natural_language_understanding',
                'options': {'features': {'categories': {},'concepts': {'limit': 8},
                                         'entities': {'emotion': False, 'limit': 50, 'sentiment': True},
                                         'sentiment': {'document': True}}},'source_field': 'text'}

custom_conf['enrichments'].append(default_conf)

In [None]:
# Add our configuration to Watson Discovery
discovery_custom.update_configuration(environment_id=env_id_custom,configuration_id=conf_id_custom,name='twitter_conf3',enrichments=custom_conf['enrichments'])

# Tell the discovery collection to use our custom configuration on new documents
updated_collection = discovery_custom.update_collection(env_id_custom, collection_id=col_id_custom, configuration_id=conf_id_custom, name='Workshop Custom').get_result()
print(json.dumps(updated_collection, indent=2))

### Now that the custom model has been added, follow the instructions to add our trained Watson Knowledge Studio model to the custom configuration.

In [None]:
# Since the documents we just uploaded were enriched using the default model, we will delete them and upload again.
query = discovery_custom.query(environment_id=env_id_custom, collection_id=col_id_custom ,query='*.*', count=50)

for doc in query.result['results']:
    delete_doc = discovery_custom.delete_document(env_id_custom, col_id_custom, doc['id']).get_result()

## 7.4 Sending the Data to the Custom Model

In [None]:
# Add tweets to enrich with our custom model
batch = "batch_one"
for ind, file_object in enumerate(os.listdir("./filtered_tweets_{}/".format(batch))):
    if file_object.endswith(".json"):
        document_json = (open(os.path.join("./filtered_tweets_{}/".format(batch),file_object)).read())
        doc_id = "tweet" + str(ind)
        add_doc = discovery_custom.update_document(environment_id=env_id_custom, collection_id=col_id_custom,
                                                   document_id = doc_id,file = document_json,
                                                   file_content_type = "application/json", filename = file_object).get_result()
    else:
        pass

In [None]:
discovery_custom.get_collection(env_id_custom, col_id_custom).get_result()

## 7.5. Analyzing Enriched Data from Custom Model

In [None]:
# Select fields to extract from enriched data
c_flds = ['tid','text',
          'enriched_text.sentiment.document.label','enriched_text.sentiment.document.score',
          'enriched_text.categories.label','enriched_text.categories.score',
          'enriched_text.entities.type','enriched_text.entities.text',
          'wks_enriched_text.entities.type','wks_enriched_text.entities.text']
c_flds = ','.join(c_flds)
c_flds

In [None]:
custom_data = pd.DataFrame(discovery_custom.query(environment_id=env_id_custom,
                                                  collection_id=col_id_custom,return_fields=c_flds,count = 100).get_result()['results'])

In [None]:
# Extract and process default categories
custom_data['wks_categories'] = custom_data['enriched_text'].apply(lambda d: d.pop('categories') if 'categories' in d else np.nan)
custom_data['wks_categories'] = custom_data['wks_categories'].apply(lambda l: set([i['label'].split('/')[-1] for i in l]))

# Extract and process entities and custom categories
custom_data['def_entities'] = custom_data['enriched_text'].apply(lambda d: d.pop('entities') if 'entities' in d else np.nan)
custom_data['wks_entities'] = custom_data.loc[~custom_data['wks_enriched_text'].isna(),'wks_enriched_text'].apply(lambda d: d.pop('entities') if 'entities' in d else np.nan)

# Extract and process sentiment
custom_data['wks_sentiment'] = custom_data['enriched_text'].apply(lambda d: d.pop('sentiment')['document'])
custom_data['sent_score'] = custom_data['wks_sentiment'].apply(lambda x: x.pop('score'))
custom_data['sent'] = custom_data['wks_sentiment'].apply(lambda x: x.pop('label'))

# Identify main issue of each tweet
custom_data.loc[custom_data['wks_entities'].isnull(),'wks_entities'] = custom_data.loc[custom_data['wks_entities'].isnull(),'wks_entities'].apply(lambda x: [])
custom_data['issue'] = custom_data['wks_entities'].apply(lambda l: set([d['type'] for d in l]))

# Identify Location, Company, Organization in each tweet
ent_types = ['Company','Organization','Person','Location']
custom_data.loc[custom_data['def_entities'].isnull(),'def_entities'] = custom_data.loc[custom_data['def_entities'].isnull(),'def_entities'].apply(lambda x: [])

for t in ent_types:
    custom_data['wks_'+t] = custom_data['def_entities'].apply(lambda l: set([i['text'] for i in l if (i['type'] == t)]))

# Drop unneccessary columns     
custom_data.drop(columns=['enriched_text','id','result_metadata','wks_enriched_text','wks_sentiment','def_entities','wks_entities'],inplace=True)

In [None]:
custom_data.head(50)

## 7.6. Merging the datasets and Deriving the Correct Location Country

In [None]:
merged_data = pd.merge(e_data_filtered[['tid','user','location','place','work_score','datetime']],custom_data,on=['tid'])

In [None]:
# Convert column object types to corresponding data types
str_cols = ['wks_categories', 'issue', 'wks_Company','wks_Organization', 'wks_Person']
for c in str_cols:
    merged_data[c] = merged_data[c].apply(lambda s: ", ".join(s))
    
merged_data['datetime']= merged_data['datetime'].apply(lambda d: pd.to_datetime(d.split(" ")[0], format='%Y-%m-%d'))
merged_data.loc[merged_data['wks_Company'].isna(),'wks_Company'] = merged_data.loc[merged_data['wks_Company'].isna(),'wks_Organization'].fillna(value="")

In [None]:
# Identify country for each of the locations
locations = ['wks_Location','place','location']
for l in locations:
    merged_data[l] = merged_data[l].apply(lambda s: "".join(GeoText(str(s)).country_mentions.keys()))

In [None]:
merged_data.sort_values(by=['wks_Location','wks_Company','issue','place'],ascending=False,inplace=True)
merged_data.drop(columns=['wks_Organization','location'],inplace=True)

In [None]:
merged_data.head(3)

In [None]:
# Save the analyzed data to Cloud Object Storage (COS)
project.save_data('new_merged_data.csv',merged_data.to_csv(index=False),overwrite=True)

<a id="visualize"></a>
# 8. Visualizing the Results on World Map

<img src="https://github.com/BKDuncan/Tweet-Analysis/blob/master/images/Step%208.PNG?raw=true" alt="step 8">

### Now that your data is saved, lets go and build a dashboard!

# END